KL Divergence in VAE - Closed form Solution

by Pritish Sunil Jadhav - Sun, 15 Dec 2019
Tags: #python #variational autoencoders #Deep Learning.

A Closed Form Solution of KL Divergence in Varitaional Autoencoders -

In this article, I would like to cover the proof for the closed form solution of KL Divergence used in the cost function of Variational Autoencoder. The ELBO cost function is given by -

\begin{align} \\ ELBO = E[\ log \ p(x| z; \theta)\ ] - D_{KL}[\ q(z|x; \phi)\ ||\ p(z)\ ] \tag{1} \\ \end{align}

and,
\begin{align} VAE_{cost} & = -ELBO \\ & = - \big(E[\ log \ p(x| z; \theta)\ ] - D_{KL}[\ q(z|x; \phi)\ ||\ p(z)\ ]\big) \tag{2}\\ \end{align}

Where,

$p(x| z; \theta)$ - probability distribution of reconstructed image given the latent representation z, parametrized by $\theta$.
$q(z|x; \phi)$ - probability density function of latent variables z given input image x, parametrized by $\phi$.
$p(z)$ - marginal probability density of z. In Variational Autoencoders, p(z) is assumed to be Multivariate Normal Distribution with mean $\mu = 0$ and Covariance $\sum = I$

Insights -

  1. For a latent representation z of size k given input x, $q(z | x)$ is a Multivariate Normal with mean $\mu$ and variance $\sigma^2 $ representation.

where,

$$ \begin{align} \mu_1 = [{\mu_1}^{(1)}, {\mu_1}^{(2)}, {\mu_1}^{(3)}, ....{\mu_1}{(k)}] \end{align} $$$$ \begin{align} {\sum}_1 = \begin{bmatrix} \sigma_1^2 & 0 & \cdots & 0 \\ 0 & \sigma_2^2 & \cdots & 0 \\ \vdots & \ddots & \cdots & \vdots\\ 0 & 0 & \cdots & \sigma_k^2 \\ \end{bmatrix} \end{align} $$
  1. The marginal probability of z, $p(z)$, is a multivariate normal with mean 0 and covariance I.
$$ \begin{align} \mu_2 = [{\mu_2}^{(1)}, {\mu_2}^{(2)}, {\mu_2}^{(3)}, ....{\mu_2}{(k)}] \end{align} $$$$ \begin{align} {\sum}_2 = \begin{bmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \ddots & \cdots & \vdots\\ 0 & 0 & \cdots & 1 \\ \end{bmatrix} \end{align} $$

Linear Algebra Properties

  1. E(x) = E(tr(x)) if x is a scalar
  2. trace(AB) = trace(BA)
  3. Cyclic property - tr(ABC) = tr(CAB) = tr(BCA)
  4. tr(ABC) $\neq$ tr(ACB)
  5. Linearity property - E(tr(x)) = tr(E(x))

Deriving the Closed form Solution -

  • Consider two probability distributions given by F(x) and G(x) with mean $\mu_1$ and $mu_2$ respectively.
  • Let the Covariances for F(x) and G(x) be given by ${\sum}_1$ and ${\sum}_2$ respectively.

Then, the KL divergence between F(x) and G(x) is given by -

$D_{KL}(F(x) \ || \ G(x)) = \sum_{x} \big[ F(x) \log \frac{F(x)}{G(x)} \big] \tag{1}$

Where,

$F(x) = \frac{1}{\sqrt{(2 \pi)^n |{\Sigma}_1|}} e^{ \frac{-(x - \mu_1)^T {{\Sigma}_1}^{-1} (x - \mu_1)}{2} } \tag{2} $

Applying log transformation to both sides of eq.(2),

$log F(x) = -\frac{n}{2} log 2\pi - \frac{1}{2} log |{\Sigma}_1| - \frac{1}{2}(x - \mu_1)^T {{\Sigma}_1}^{-1} (x - \mu_1) \tag{3}$

Similarly,

$log G(x) = -\frac{n}{2} log 2\pi - \frac{1}{2} log |{\Sigma}_2| - \frac{1}{2}(x - \mu_2)^T {{\Sigma}_2}^{-1} (x - \mu_2) \tag{4}$

Therefore,

\begin{align} \log \frac{F(x)}{G(x)} &= log F(x) - log G(x) \\ &= \big[ -\frac{n}{2} log 2\pi - \frac{1}{2} log |{\Sigma}_1| - \frac{1}{2}(x - \mu_1)^T {{\Sigma}_1}^{-1} (x - \mu_1) \big] - \big[-\frac{n}{2} log 2\pi - \frac{1}{2} log |{\Sigma}_2| - \frac{1}{2}(x - \mu_2)^T {{\Sigma}_2}^{-1} (x - \mu_2) \big] \\ &= \big[ \require{cancel} \cancel{-\frac{n}{2} log 2\pi} - \frac{1}{2} log |{\Sigma}_1| - \frac{1}{2}(x - \mu_1)^T {{\Sigma}_1}^{-1} (x - \mu_1) \big] + \big[ \require{cancel} \cancel{\frac{n}{2} log 2\pi} - \frac{1}{2} log |{\Sigma}_2| - \frac{1}{2}(x - \mu_2)^T {{\Sigma}_2}^{-1} (x - \mu_2) \big] \\ &= \frac{1}{2} log \frac{|{\Sigma}_2|}{|{\Sigma}_1|} - \frac{1}{2} (x - \mu_1)^T {{\Sigma}_1}^{-1} (x - \mu_1) + \frac{1}{2} (x - \mu_2)^T {{\Sigma}_2}^{-1} (x - \mu_2) \tag{5} \end{align}

from eq (1) and eq (5) -

\begin{align} D_{KL}(F(x) \ || \ G(x)) &= \sum_{x} \big[ F(x) \log \frac{F(x)}{G(x)} \big] \\ &= \sum_{x} F(x) \big[ \frac{1}{2} log \frac{|{\Sigma}_2|}{|{\Sigma}_1|} - \frac{1}{2} (x - \mu_1)^T {{\Sigma}_1}^{-1} (x - \mu_1) + \frac{1}{2} (x - \mu_2)^T {{\Sigma}_2}^{-1} (x - \mu_2) \big] \\ &= \frac{1}{2} \sum_{x} F(x) log \frac{|{\Sigma}_2|}{|{\Sigma}_1|} - \frac{1}{2} \sum_{x} F(x) \big[ (x - \mu_1)^T {{\Sigma}_1}^{-1} (x - \mu_1) \big] + \frac{1}{2} \sum_{x} F(x) \big[(x - \mu_2)^T {{\Sigma}_2}^{-1} (x - \mu_2) \big] \tag{6} \end{align}

We shall evaluate eq.(6) by breaking it into sub parts.

Part 1 -

\begin{align} \frac{1}{2} \sum_{x} F(x) log \frac{|{\Sigma}_2|}{|{\Sigma}_1|} = \frac{1}{2} log \frac{|{\Sigma}_2|}{|{\Sigma}_1|} \sum_{x} F(x) \end{align}

we know that, $\sum_{x} F(x) = 1$

\begin{align} \therefore \frac{1}{2} \sum_{x} F(x) log \frac{|{\Sigma}_2|}{|{\Sigma}_1|} = \frac{1}{2} log \frac{|{\Sigma}_2|}{|{\Sigma}_1|} \tag{7} \end{align}

Part 2 -

\begin{align} \frac{1}{2} \sum_{x} F(x) \big[ (x - \mu_1)^T {{\Sigma}_1}^{-1} (x - \mu_1) \big] = \frac{1}{2} E_{F(x)} \big[ (x - \mu_1)^T {{\Sigma}_1}^{-1} (x - \mu_1) \big] \end{align}

Using property 1 from Linear Algebra Properties

\begin{align} \frac{1}{2} E_{F(x)} \big[ (x - \mu_1)^T {{\Sigma}_1}^{-1} (x - \mu_1) \big] = \frac{1}{2} E_{F(x)} \big[ Tr\big( (x - \mu_1)^T {{\Sigma}_1}^{-1} (x - \mu_1)\big) \big] \end{align}

Using Cyclic property from Linear Algebra Properties , we get

\begin{align} \frac{1}{2} E_{F(x)} \big[ Tr\big( (x - \mu_1)^T {{\Sigma}_1}^{-1} (x - \mu_1)\big) \big] = \frac{1}{2} E_{F(x)} \big[ Tr\big( (x - \mu_1)(x - \mu_1)^T {{\Sigma}_1}^{-1} \big) \big] \end{align}

Using Linearity property from Linear Algebra Properties , we get

\begin{align} \frac{1}{2} E_{F(x)} \big[ Tr\big( (x - \mu_1)(x - \mu_1)^T {{\Sigma}_1}^{-1} \big) \big] &= \frac{1}{2} Tr \big[ E_{F(x)}\big( (x - \mu_1)(x - \mu_1)^T {{\Sigma}_1}^{-1} \big) \big] \\ &= \frac{1}{2} Tr \big[ E_{F(x)}\big( (x - \mu_1)(x - \mu_1)^T \big) {{\Sigma}_1}^{-1} \big] \end{align}

By Definition, we know that $E_{F(x)}\big( (x - \mu_1)(x - \mu_1)^T \big) = {\Sigma}_1$

\begin{align} \therefore \frac{1}{2} \sum_{x} F(x) \big[ (x - \mu_1)^T {{\Sigma}_1}^{-1} (x - \mu_1) \big] &= \frac{1}{2} Tr \big[ E_{F(x)}\big( (x - \mu_1)(x - \mu_1)^T \big) {{\Sigma}_1}^{-1} \big] \\ &= \frac{1}{2} Tr \big[ {\Sigma}_1 {{\Sigma}_1}^{-1} \big] \\ &= \frac{1}{2} Tr [I_k] \\ &= \frac{1}{2} k \tag{8} \end{align}

Part 3 -

Computing a closed form solution for part 3 of the eq(6) won't be as staright forward as part 2. Let is start by writing down the equation before we decide next steps.

Part 3 = $\frac{1}{2} \sum_{x} F(x) \big[(x - \mu_2)^T {{\Sigma}_2}^{-1} (x - \mu_2) \big] \tag{9}$

  • Note that we cannot simply replace $\sum_{x} F(x) \big[(x - \mu_2)^T {{\Sigma}_2}^{-1} (x - \mu_2) \big]$ by Expectation term since $\mu_2$ and $\Sigma_2$ define distribution G(x) and not F(x).
  • To resolve this issue, we shall apply a mathematical trick of adding and subtracting terms which will make our computations possible.

Adding and subtracting $\mu_1$ in eq.(9)

\begin{align} part\ 3\ &= \frac{1}{2} \sum_{x} F(x) \big[[(x - \mu_1) + (\mu_1 -\mu_2)]^T {{\Sigma}_2}^{-1} [(x - \mu_1) + (\mu_1 - \mu_2)] \big] \\ &= \frac{1}{2} \sum_{x} F(x) \big[[(x - \mu_1)^T + (\mu_1 -\mu_2)^T] {{\Sigma}_2}^{-1} [(x - \mu_1) + (\mu_1 - \mu_2)] \big] \\ &= \big( \frac{1}{2} \sum_{x} F(x) (x - \mu_1)^T {{\Sigma}_2}^{-1} (x - \mu_1) \big) + \frac{2}{2} \sum_{x} F(x) \big((x - \mu_1)^T {{\Sigma}_2}^{-1} (\mu_1 - \mu_2) \big) + \sum_{x} F(x) \frac{1}{2} \big( (\mu_1 -\mu_2)^T {{\Sigma}_2}^{-1} (\mu_1 - \mu_2) \big) \tag{10} \end{align}

We shall further break down the computation of eq(10) into 3 sub parts as follows -

  • part 3.1 - $\big( \frac{1}{2} \sum_{x} F(x) (x - \mu_1)^T {{\Sigma}_2}^{-1} (x - \mu_1) \big)$
  • part 3.2 - $\frac{2}{2} \sum_{x} F(x) \big((x - \mu_1)^T {{\Sigma}_2}^{-1} (\mu_1 - \mu_2) \big)$
  • part 3.3 - $\frac{1}{2} \sum_{x} F(x) \big( (\mu_1 -\mu_2)^T {{\Sigma}_2}^{-1} (\mu_1 - \mu_2) \big)$

Part 3.1

\begin{align} part\ 3.1\ &= \big( \frac{1}{2} \sum_{x} F(x) (x - \mu_1)^T {{\Sigma}_2}^{-1} (x - \mu_1) \big) \\ &= \frac{1}{2} E_{F(x)}\big[ (x - \mu_1)^T {{\Sigma}_2}^{-1} (x - \mu_1) \big] \\ &= \frac{1}{2} E_{F(x)}\big[Tr\big((x - \mu_1)^T {{\Sigma}_2}^{-1} (x - \mu_1) \big)\big] \\ &= \frac{1}{2} Tr\big[E_{F(x)}\big((x - \mu_1)^T {{\Sigma}_2}^{-1} (x - \mu_1) \big)\big] \\ &= \frac{1}{2} Tr\big[E_{F(x)}\big( (x - \mu_1)(x - \mu_1)^T {{\Sigma}_2}^{-1} \big)\big] \\ &= \frac{1}{2} Tr[\Sigma_1 \Sigma_2^{-1}] \tag{11} \end{align}

Part 3.2

\begin{align} part\ 3.2 &= \frac{2}{2} \sum_{x} F(x) \big((x - \mu_1)^T {{\Sigma}_2}^{-1} (\mu_1 - \mu_2) \big) \\ &= (\sum_{x} F(x) x - \mu_1)^T {{\Sigma}_2}^{-1} (\mu_1 - \mu_2) \\ &= (\mu_1 - \mu_1)^T {{\Sigma}_2}^{-1} (\mu_1 - \mu_2) \\ &= 0. {{\Sigma}_2}^{-1} (\mu_1 - \mu_2) \\ &= 0 \tag{12} \end{align}

Part 3.3

\begin{align} \frac{1}{2} \sum_{x} F(x) \big( (\mu_1 -\mu_2)^T {{\Sigma}_2}^{-1} (\mu_1 - \mu_2) \big) &= \frac{1}{2} \big( (\mu_1 -\mu_2)^T {{\Sigma}_2}^{-1} (\mu_1 - \mu_2) \big) \tag{13} \end{align}

Substituting eq(11), eq(12) and eq(13) in eq(10), we get

\begin{align} \frac{1}{2} \sum_{x} F(x) \big[(x - \mu_2)^T {{\Sigma}_2}^{-1} (x - \mu_2) \big] = \frac{1}{2} Tr[\Sigma_1 \Sigma_2^{-1}] + \frac{1}{2} \big( (\mu_1 -\mu_2)^T {{\Sigma}_2}^{-1} (\mu_1 - \mu_2) \big) \tag{14} \end{align}

Putting it all together -

Substituting eq(7), eq(8), eq(14) in eq(6), we get

\begin{align} D_{KL}(F(x) \ || \ G(x)) &= \frac{1}{2} log \frac{|{\Sigma}_2|}{|{\Sigma}_1|} - \frac{1}{2} k + \frac{1}{2} Tr[\Sigma_1 \Sigma_2^{-1}] + \frac{1}{2} \big( (\mu_1 -\mu_2)^T {{\Sigma}_2}^{-1} (\mu_1 - \mu_2) \big) \\ &= \frac{1}{2} \big[ log \frac{|{\Sigma}_2|}{|{\Sigma}_1|} - k + Tr[\Sigma_1 \Sigma_2^{-1}] + (\mu_1 -\mu_2)^T {{\Sigma}_2}^{-1} (\mu_1 - \mu_2) \big] \tag{14} \end{align}

Equation 14 gives us a closed form solution for Kl divergence between two multinormal distributions.

In the last section of this blog post, we shall derive the closed form of KL divergence used in Variational Autoencoders.

From eq(14), we know that,

\begin{align} D_{KL}(F(x) \ || \ G(x)) = \frac{1}{2} \big[ log \frac{|{\Sigma}_2|}{|{\Sigma}_1|} - k + Tr[\Sigma_1 \Sigma_2^{-1}] + (\mu_1 -\mu_2)^T {{\Sigma}_2}^{-1} (\mu_1 - \mu_2) \big] \end{align}

For Variational Autoencoders - \begin{align} F(x) = q(z|x) \ \sim N(\mu, \Sigma) \\ Q(x) = p(z) \ \sim N(0, I) \end{align}

where,
$q(z|x; \phi)$ - probability density function of latent variables z given input image x, parametrized by $\phi$.
$p(z)$ - marginal probability density of z. In Variational Autoencoders, p(z) is assumed to be Multivariate Normal Distribution with mean $\mu = 0$ and Covariance $\sum = I$

\begin{align} \therefore D_{KL}(q(z|x) \ || \ p(z)) &= \frac{1}{2} \big[ log \frac{|I|}{|\Sigma|} - k + Tr[\Sigma I^{-1}] + (\mu -0)^T I^{-1} (\mu - 0) \big] \\ &= \frac{1}{2} \big[ - log |\Sigma| - k + Tr[\Sigma] + (\mu^T \mu \big] \\ &= \frac{1}{2} \big[ - \sum_{i= 1}^k \big( log \sigma_{i}^2 +1 \big) + \sum_{i= 1}^k {\sigma_{i}}^2 + \sum_{i= 1}^k {\mu_i}^2 \big] \end{align}

Comments